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Symbolic Data Analysis works with variables for which each unit or class of units takes a finite set 
of values/categories, an interval or a distribution (an histogram, for instance). When to each observation 
corresponds an empirical distribution, we have a histogram- valued variable; it reduces to the case of 
an interval- valued variable if each unit takes values on only one interval with probability equal to one. 
_Y-\ Distribution and Symmetric Distribution is a linear regression model proposed for histogram-valued 

variables that may be particularized to interval-valued variables. This model is defined for n explicative 
variables and is based on the distributions considered within the intervals. In this paper we study the 
special case where the Uniform distribution is assumed in each observed interval. As in the classical 
case, a goodness-of-fit measure is deduced from the model. Some illustrative examples are presented. 
A simulation study allows discussing interpretations of the behavior of the model for this variable type. 
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^^ About 30 years ago Schweizer advocated that "distributions are the numbers of the future". Following 

(T^ in his footsteps, Diday generalized the classical concept of variables in Multivariate Data Analysis and 

", , introduced Symbolic Data Analysis [ ]. The extensive and complex data that emerged in the last decades 

^ made it necessary to extend and generalize the classical concept of data sets. Data tables where the cells 

S^ contain a single quantitative or categorical value were no longer sufficient. More complex data tables were 

needed, with cells that include more accurate and complete information. Each cell should express the 
variabiUty of the records of each observed unit. These tables are called symbolic data tables [ 1 ] and their 
cells may contain finite sets of values/categories, intervals or distributions. The corresponding variables 
are named symbolic variables. In this case, the objects may be one unit (first-level units) or classes of 
units (higher-level units). Symbolic variables can be classified as multi-valued quantitative/qualitative 
variables when each unit or class of units takes a finite set of values/categories; interval-valued variables 
when the values that the variable takes are intervals; modal multi-valued variable when to each (first-level 
or higher-level) unit coiTesponds a probability/frequency/weight distribution. Histogram- valued variables 
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constitute a particular case of this latter kind of symbolic variables where to each entity under analysis 
corresponds an empirical distribution. However if, for all observations, each unit takes values on only one 
interval with probability/frequency one, the histogram-valued variable is then reduced to the particular 
case of an interval- valued variable. Table 1 is an example of a symbolic data table where the entities under 
analysis are healthcare centers (higher-level units). This table results from the aggregation (contemporary 
aggregation, [2]) of records in a classical data table where the observed units are the patients (first-level 
units) of each healthcare center. 



Healthcare 
centers 


Gender 


Age 


Number of 
emergency consults 


Waiting time for 
consult (in minutes) 


A 


{F.hM.\} 


[25,53] 


{0,1,2} 


{[15,30[,0.25; 
[30, 45[, 0.5; > 60,0.25} 


B 


{^i;M,i} 


[33,68] 


{0,1,4,5,10} 


{[0, 15[, 0.25; [15, 30[, 0.25; 
[30,45[,0.25;[45,60[,0.25} 


C 


{F^hM,l} 


[20,75] 


{0,1,7,14} 


{[0,15[,0.33; 
[30, 45[, 0.33; > 60,0.33} 



Table 1: Symbolic data table with information corresponding to three healthcare centers. 



The symbolic variables in Table 1 are classified as follows: age is a interval-valued variable; number 
of emergency consults is a multi-valued quantitative variable; gender and waiting time for consult are 
modal- valued variables. The waiting time for consult is more precisely a histogram- valued variable. 
Alternatively, we could compute and record only the mean, median, maximum or mode of the observed 
values in each healthcare center, but in this case the variability of the data would be lost. 

In other situations we may have multiple records associated to each unit that may be the result of several 
observations performed in one day /month/year If we want to study this variable, and as an alternative to 
summarizing all values in just one value - and thereby losing the information of the variability - and if the 
observed order is not pertinent, we may aggregate the information referring to one specific period of time 
(temporal aggregation, [2]). Thereby each unit (first-level unit) may be associated to an interval of values 
(interval-valued variable) or to a distribution (histogram-valued variable). 

In recent years, statistical concepts and methods for analyzing such symbolic data have been developed 
[7, 6, 5, 4, 3]. Interval-valued variables are the most studied among symbolic variable types. Even though 
distributions are the "numbers of the future", it does not appear simple to work with these elements. 
Typically, concepts and methods for interval-valued variables are defined first, and only then an attempt 
to generalize them to histogram-valued variables is made. This approach is also used because histogram- 
valued variables are considered to be a generalization of interval-valued variables. In this study the 
approach is different. We will consider interval-valued variables as a particular case of histogram-valued 
variables, and we will particularize the linear regression model proposed for histogram-valued variables, 
the Distribution and Symmetric Distribution Regression Model [8]. In this paper, we will consider that 



the "values" associated to each observation of the explicative and response interval-valued variables are 
uniformly distributed across each interval; however, other distributions may be considered. 

In the framework of symbolic data analysis, the linear regression models for interval-valued variables 
previously proposed are very different from the one presented here. The most noteworthy of the proposed 
models are the Center Method [ ]; the MinMax Method [ ]; the Center and Range Method [ ] and 
the Constrained Center and Range Method [ I 2]. In all these methods it is possible to predict a response 
variable from n explicative variables. The referred models do not treat the intervals as such, they require 
the adjustment of classical linear regression models for the lower and upper bounds or for the center and 
half range. In other words, these models are based on the difference between real values and do not 
quantify the closeness between intervals. Therefore, the elements estimated by the models may fail to 
build an interval; to solve this problem the most recent model imposes non-negative constraints in the 
linear regression between the half ranges of the intervals [12]. Recently, a Particular Swarm Optimization 
(PSO) algorithm has been applied to estimate the parameters of the linear regression models mentioned 
above and this new method provides satisfactory results [ ]. In 201 1, Giordani proposed a new approach 
to linear regression for interval-valued variables based on the Lasso technique, named Lasso-IR method 
[14]. As in the Center and Range Method and Constrained Center and Range Method in this new ap- 
proach the linear relationship between interval-valued variables also considers two regression models, 
one for the centers and another for the half ranges. However, in this case, the parameters of the models 
are related and although the model imposes constraints on the linear regression between the half ranges, 
it does not impose a direct linear relationship between them. Another limitation of all linear regression 
models referred to above, is that no goodness-of-fit measure is deduced from the models. The limitations 
described above and the complexity inherent to working with histograms may prevent a generalization of 
the models to histogram-valued variables. 

Most linear regression models proposed to interval-valued variables and histogram-valued variables in 
the context of Symbolic Data Analysis are descriptive. The development of non-descriptive methods is 
still an open research topic for almost all kinds of symbolic variables. However, some papers recently 
published propose probabilistic models for interval-valued variables and inference studies were presented 
(see [18, 17, 16, 15]). Of these works the research of Lima Neto et al. should be emphasized [17]. In 
this study the authors represent an interval-valued variable F as a bivariate vector (^1,^2) where Yi 
and I2 are one-dimensional random variables. The Bivariate symbolic regression Models proposed in 
this work, are a generalization of the theory regression models. In this case, the authors assume that the 
response interval-valued variable belongs to the bivariate exponential family of distributions. The models 
proposed by Lima Neto et al. [17] do not have some of the problems associated to the descriptive models 
previously proposed. They guarantee that the upper bound of the estimated interval is always greater than 
or equal to the lower bound, a goodness-of-fit measure was deduced; a definition of residuals for intervals 
is performed and inference techniques were also proposed (residual analysis and diagnostic measures). 

Other studies also investigate linear regression models for other data where the observations also take the 
form of intervals, i.e., imprecise data. It is however important to underline that these data are different 
from symbolic data. Although the type of observations are the same, i.e., intervals, their meaning and 
the way they are built is different. Imprecise data occur when each interval associated to each unit 



under analysis represents the uncertain value associated to the record. For example, they may result 
from the measure of distances or longitudes with imprecise instruments. In this context, "the intervals 
are a imprecise perception of real values non observable" [ ]. All linear regression models defined 
for imprecise data predict one interval from other intervals using interval arithmetic [ ]. The use of 
this arithmetic is probably one of the reasons that makes the generalization of the models to n explicative 
variables difficult. The first Unear regression models proposed for this kind of elements were simple linear 
regression models defined in a descriptive context [20, 21], More recently, developments for the analysis 
of imprecise data have been made in an inferential framework. Random intervals or interval-valued 
random sets variables are defined as a generalization of random variables (real-valued random variables) 
when the outcomes that result from a random experience are described by a compact set instead of a 
real number. Some linear regression models between random imprecise elements have been proposed, 
we may cite the populacional Model MRLS [ ] and the more flexible Model M ["]. However, these 
models only allow predicting one response interval-valued variable from one explicative valued variable 
and always induce direct linear relationships between the half ranges of the intervals as the Constrained 
Center and Range Method. 

The remainder of the paper is organized as follows. Section 2 introduces the representation of intervals 
by quantile functions and presents the new approach for a linear regression model with interval-valued 
variables. Section 3 reports two simulation studies and discusses their results. In Section 4, some illustra- 
tive applications are presented. Finally, Section 5 concludes the paper, pointing out directions for future 
research. 



2 DSD Model for interval-valued variables 

The Distribution and Symmetric Distribution (DSD) Regression Model is a linear regression model for 
histogram-valued variables proposed in [ ]. Since interval-valued variables are a particular case of 
histogram- valued variables we may apply the DSD Model to interval-valued variables. The innovations 
of the DSD regression model for interval-valued variables that we propose in this paper are as follows. 
First and foremost, the model works with intervals and considers the distribution within the intervals; in 
this paper the Uniform distribution is assumed, but other distributions may also be considered. Then, 
the intervals are represented by quantile functions. Also, the model allows predicting a response variable 
from n explicative variables and the predicted range of values always constitutes an interval; the linear 
relationships between the centers and half ranges induced by the model between the intervals are different 
although related. Furthermore, it is possible to deduce a goodness-of-fit measure from the model. The 
fact that we shall be using a representation of the intervals by quantile functions makes it important to 
make a short introduction to these functions. 



2.1 Quantile functions 

When we have a interval-valued variable Y, to each unit j corresponds one "symbolic value" (range of 
the values) that may be represented by an interval Iy{j) or by the respective quantile function ^^J .-. . 

Definition 2.1 Y is a interval-valued variable when to each unit j e {1, . . . , to} corresponds an interval 
Y{j) of real numbers. Y{j) may be represented by the interval [4]: 



^YU) 



IyU)Jy(j) 



where Lyij) ^^^^ ^Y(j) "^""^ the lower and upper bounds of the interval Iy(j)^ respectively. 

It is also possible to represent the interval Y{j) by its center cy(j) = ^'■'^ ^-^'-J'' and half-range ry^j) 



^"•"L-^'^' .In this case. 



Ivif) = [cyu) - rY(jy,CY(^f) + rY(j)] 



Alternatively, considering the distribution within the intervals, they may be represented by quantile func- 
tions. Assuming an Uniform distribution in all intervals Y{j), we may represent each interval Y{j) by a 
linear function with domain [0, 1] as follows: 



^Y(j)it) - Lyu) + [Iyu) - Lyu)) t^ 0<t<l 
or using the center cy{j) and half-range rY{j) of the interval as 

*yO) W = ^^0) + ^^0-)(2i - 1)- < i < 1. 

The representation of the intervals by linear functions was presented by Bertoluzza et al. [24], that termed 
it parametrization of the interval. More recently, and particularizing this representation from the piece- 
wise function that represents histograms, Irpino and Verde [25] named the linear function as a quantile 
function, the inverse cumulative distribution function. 

Since in all intervals, the lower bound is always less than or equal than the upper bound, Lyn) < -^Y(j) i 
the quantile function that represents an interval is always a non-decreasing function [ ]. This behavior is 
illustrated in Example 2.1. 

Example 2.1 Consider again the interval-valued variable "Age ", Y2 in Table 1, and the respective in- 
tervals corresponding to each of the three healthcare centers. The observed value of this interval-valued 
variable Y2 for Healthcare center A, may be represented by: 

Y2{A) = [25; 53] 



or 



"^yliA) = 25 + 28i with < i < 1. 



The quantile functions that represent the intervals of the ages associated to each healthcare center are 
represented in Figure 1. 



Figure 1 : Representation of the quantile functions ^ 
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Y2{A)^ ^Y2(B)' ^Y2{C) 



in Table 1. 



In the DSD Model proposed in this paper we will work with quantile functions. As these functions 
are linear functions with domain [0, 1] we shall use the usual function arithmetic. However, when we 
use functions' arithmetic to operate with quantile functions, problems may arise. Quantile functions are 
non-decreasing functions, the addition of quantile functions is a non-decreasing function, but when we 
multiply a quantile function by a negative number, we obtain a function that is not non-decreasing [ "] (See 
Figure 2). So, the problem arises of how to obtain the symmetric of the quantile function associated with 
a given interval. Consider the interval / and let —I the respective symmetric. If ^~^(i) with t e [0, 1] 
is the quantile function that represents /, the quantile function that represents — / is — ^^^^(l — t) with 
t e [0, 1]. Figures 2 and 3 illustrate this situation. 




Figure 2: Representation of the functions ^ ^{t); —^ ^(1 — i) and — $ ^(i). 



|l~~llnterval[1,3] ^SSi Symmetric Interval [-3,-1 



Figure 3: Representation of the intervals I — [1,3] and — / 
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It is important to underline that some properties met by the usual symmetric elements are not met when 
these elements are ranges of values. The addition of interval / with — / or the respective quantile func- 
tions, is not the null interval; because of this, the difference between ranges of values does not provide 
information on how dissimilar the intervals are. The difference between two equal intervals is an interval 
with symbolic mean zero [5], that is, an interval with center zero and symmetric bounds. 

The reasons given above show why the difference between two ranges of values is not a good solution 
to measure the dissimilarity between intervals as in classical statistics. In classical linear regression, to 
quantify the error between the observed values yj and the predicted values jjj , the difference between 
two real numbers, e^ = yj — yj , is used. In this case, the model that estimates the values yj minimizes 

m 

/^iVj ~ VjY- For intervals as well as for histograms, rather than using the difference between these 

"symbolic values" we will evaluate the dissimilarity using a distance. As for the case of histogram-valued 
variables, the Mallows distance is used [8]. 

Definition 2.2 Given two quantile functions ^^/ ■> and ^y( ') ^^^^ represent the range of values that the 
interval-valued variables X and Y take at observation j, the square of the Mallows distance is defined 
as follows [26]: 



^Mi^xurM,)) - I {^-4,^{t)-^yl^,{t)rdt 



Considering a Uniform distribution across the intervals, Irpino and Verde [: ] rewrite the square of the 
Mallows distance as follows: 

Proposition 2.1 Given two quantile functions '^x( ) '^^^ ^y( "1 ^'^'^^ represent the interval-valued vari- 
ables X and Y to each observation j, the square of the Mallows distance may be expressed by [25]: 



D'lA-^ 



XUV'^YU)) = (cxo") - CY(j)) + ^{rxu) - ryu)) 



where Cyu), Cx{j) '^''^ ^^^ centers and ryij), fx(j) '^''^ t^ half-ranges of the intervals X{j) and Y{j)^ 
respectively, with j G {1,2,..., m} 

Is important to underline that using this distance to measure the similarity between intervals is not new. 
It is a particular case of the Bertoluzza distance, used in literature to measure the distance between two 
intervals [24]; in the linear regression models proposed for interval-valued random sets, a generalization 
of this distance is also used [23, 22]. 

2.2 The DSD Model 

The linear regression model proposed by Dias and Brito [ ] for histogram-valued variables uses quantile 
functions to represent the distributions that the histogram- valued variables take. For each unit it is possible 
to predict response quantile functions from other quantile functions. However, the parameters of the 
model would have to be non-negative to ensure that the predicted functions are non-decreasing functions, 
in which case, the linear regression would always be direct. This does not happen because the model not 
only includes the quantile functions that represent the distributions that the explicative histogram-valued 
variables Xk (j) take for each unit j, ^^^ , > , but also the quantile functions that represent its symmetric 
histogram. The presence of these two quantile functions associated to the same unit j allows obtaining a 
direct or inverse linear relation between histogram/interval-valued variables even though the coefficients 
in the model are all positive. 

The DSD linear regression model for histogram-valued variables proposed by Dias and Brito [ ] may be 
particularized to interval-valued variables, as follows. 



Definition 2.3 Consider the interval-valued variables Xi] X2] . . . ; Xp. The quantile functions that rep- 
resent the range of values that these variables take for each unit j are denoted ^^ / -At), "^^ , .Jt), . . . , 
^x ( )(^) '^"'^ ^^^ quantile functions that represent the respective symmetric interval associated to each 
unit of the referred variables are denoted — 'J'T. f -,(1 — t), ^^x ('\^^ ~ ^)j • ■ • ? "^y i -^i^ ~ Oi with 
t € [0, 1]. Each quantile function "^yi V ^'^y ^^ expressed as follows: 

where "^7^,. At) w the predicted quantile function for the unit j, obtained from 

fc=l k=l 

witht e [0, 1] ; a/c/Sfc > 0, fc e {1,2, . . . ,p} and 7 e M. 

This linear regression model is named Distribution and Symmetric Distribution (DSD) Regression 
Model. 



Particularizing Definition 2.3 to the situation studied in this paper, where we assume uniformity within 
the intervals, the predicted quantile function ^- J , is defined as follows: 

p p 

fe=i fe=i 

with t e [0, 1] ; afc, ^Sfe > 0, /c € {1, 2, . . . ,p} and 7 G M. 

For each unit j, the predicted interval lyd) ™^y ^^ obtained from 



^yu) 



p p 



{aklx.U) - Pklx.ij)) + 7, E ("fe^^^-(j) - Pt^^Xuij)) + T 



.fc=l k = l 



(2) 



The error, for each unit j, is a function, but not necessarily a quantile function, given by ej{t) = 

y{jr ' Y[jy ' 
By including in the model both the distribution of the explicative interval-valued variables and the re- 
spective symmetric, the linear relationship between the intervals is not necessarily direct, even though 
positivity restrictions are imposed on the parameters. According to the DSD Model, the center cp (j) and 
half-range Vyij) (or the bounds) of the predicted interval-valued variable may be described by a clas- 
sical linear regression for the centers cx^ (j) and half-ranges rx^ (j) (or the bounds) of the explicative 
interval-valued variables. These linear regressions are the follows: 



fc=l 
P 

fc=l 
with ak,l3k > 0, and 7 e K. 

From Equations (3) and (4) we may observe that the parameters that define the linear regressions between 
the centres and half ranges of the intervals are not the same but are related. In spite of the fact that this 
model is defined between intervals and the relationship between the intervals may be direct or inverse, it 
always induces a direct linear relationship between the half ranges of the intervals. The direct or inverse 
relationship between the interval-valued variables is always in accordance with the linear relationship 
between the centers. The interval- valued variables Xk are in direct linear relationship with Y when 
ttfc > f3k and the linear relation is inverse if ak < Pk- 

The non-negative parameters of the DSD model, in Definition 2.3, are determined solving a quadratic op- 
timization problem, subject to non-negativity constraints on the unknowns. The distance used to quantify 
the dissimilarity between the predicted and the observed quantile function is the Mallows Distance [26], 



Consider the centers cy and half-ranges ry of the observed intervals ly. and the predicted intervals 
Jo defined in Equation (2). The quadratic optimization problem that is necessary to solve to obtain the 
parameters of the model is then: 



min y 



\ k=l / \ fc=l 



iak+l3k)rx,u) 



(5) 



s.t. afc,/3fc > 0,fc e {1,2, ...,p} 

7eM 



The optimization problem in (5) may also rewritten in matricial form as a classical constraint quadratic 
optimization problem (see [8]). However this problem may also be defined as a constraint least square 
problem. 

Consider the vectors of length m, of the observed centers and half ranges of the response variable Y : 

y° = (cy(i),---,cy(„)) and y'' = (ry(i), . . . ,ry(,„)) ; 

the vector of length 2p + 1, of the parameters of the model: 



b = (ai,/3i,...,ap,/3p,7) 



From vectors x? and Xj" defined by 



^i = (cXi (j) , -cxi 0) , . . . , cx^ u) , "CXp (j) , 1) and xj = (r^, (j) ,rx,ij),..., rx^ (j) , rx^ u) , O) : 
we can build the matrices of order {2p +1) x to : 



X'^ = [x5x^..,x^] and X"- = [x^^ x^ . . . xi 



" 1 



With these matrices, the minimization problem in (5) may be rewritten in matricial form as follows: 
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min lly'^ - (X'=)^b|l2 + i|lyr - (X^^blp 

(6) 
s.t. afc,/3fc >0,A:e {l,2,...,p} 

7eM 



But, as the parameters for the centers and half -ranges may not be obtained independently, we may rewrite 
the optimization problem (6) as a least square problem: 



mm 



L %/3-^ 



(X^)^ 
^(X^)^ 



V3 



= IIY-Xb|l 



(7) 



s.t. ak,(3k>0,ke{l,2,...,p} 

7eM 



Several methods may be found in the literature to solve the constrained least squares problem (7) and, 
therefore the constrained quadratic optimization problem (5). 

As the quadratic function to optimize is convex and the feasible region too, it may be ensured that the 
vectors that verify the Kuhn Tucker conditions are the vectors where the function reaches the global 
minimum, i.e. are the optimal solutions. In cases when the objective function is strictly convex we can 
ensure that the optimal solution is unique. 

Let {al,f3l, ■ ■ ■ , a* , /?*, 7*) be an optimal solution of the optimization problem in (5). Dias and Brito 
[8] proved that the mean of the predicted histogram-valued variable Y is given by: 



k=l 



(8) 



As the quantile function ^ 



^fcO) 



(t) and the respective symmetric "^^^ , -.(l — t) with t e [0, 1] are both 



in the DSD Model, it is important to analyze the behavior of the model in the situations where these 
functions are collinear Proposition 2.2 below allows deducing the collinearity conditions. 

Proposition 2.2 The quantile functions ^^ r)(0 — '^^kU) + ^^fcO)(2^ ^ 1) '^^^ ^"^xi )(^ ^ ^) = 
~''^fc(j)~'~^-ffc(i)(^^~-'^) ^'^^ < i < 1 that represent the intervals Ixk(j) ^""^ ^lxk(j)^ respectively, are 
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collinear if the interval Ixk(j) ^'^^ '"'Xk(j) ~ 0' which means that the interval is symmetric, or rx^(j) — 0, 
which means that the interval is reduced to a real number (degenerate interval). 

Proof: The quantile functions '^^ i At) and —'^^, .^(1 — t) with < f < 1 are colUnear if there exists 
a real number A 7^ such that -'J'^VJl - i) = -^^x^ )(*); with t e [0, 1] . 

-*xlo)(l - = ^*x'a) W ^^ -cx,o) + rx,ij){2t - 1) = A (cx,(,) + rx,0)(2i - 1)) 
=^ (cx,o-) = A A = 1 A Tx,u) e K) V (rx,(,) = A A = -1 A cx,i,) G M) 

Therefore two quantile functions are collinear when the interval Ix^ij) is symmetric that is Ix^ij) = 

[-'rx,{jy,rx,{j}] or degenerate, i.e., /^.o) = cx.u) □ 

The D^D Model can nevertheless be applied when the quantile function ^^^/ -,(t) and the respective 
symmetric are collinear. However, Equation (1) is reduced to the classical linear regression model, be- 
tween the centers, in Equation (3), when the all intervals of the explicative interval-valued variables are 
degenerate and between the half ranges, in Equation (4), when all intervals of the explicative interval- 
valued variables are symmetric. 

When the collinearity between the interval-valued variable and respective symmetric is verified, the op- 
timization problem has an optimal solution but it is not unique because, in this situation the quadratic 
function to optimize is not strictly convex (the columns of X in Equation (6) are linearly dependent). 
However all values of the parameters when the global minimum is attained allow obtaining the same 
model, that in these cases isa classical model between the centers or the half ranges. 

As the DSD model for interval-valued variables is a particular case of the model defined by Dias and 
Brito [8] for histogram-valued variables, the optimal solution of the quadratic optimization problem for 
interval-valued variables with non-negative constraints verifies the Kuhn Tucker conditions. It is therefore 
possible to prove the following decomposition [ ]: 

m 771 771 

This decomposition allows defining the goodness-of-fit measure for the proposed model for interval- 
valued variables. 

Definition 2.4 Consider the observed and predicted ranges of values of the interval-valued variable Y 
and Y represented, respectively, by their quantile functions ^y{j) ^f^d ^- . Consider also the symbolic 

rrt 

mean of the interval-valued variable Y, given by Y — ^y '^Y(j) I ']■ The goodness-of-fit measure is 
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given by 

m m 



n^^^ = ^ 



2 1 



3 YU) 



Y.Dl,hyl,{t),Y) Y.-i^'yu)-^" + V 



As in classical linear regression, where the coefficient of determination R^ ranges from to 1, the 
goodness-of-fit measure, il, also ranges between and 1 . 

2.3 The DSD Model is a generalization of the classical linear regression model 

Symbolic variables, introduced in Symbolic Data Analysis, are a generalization of classical variables, and 
the statistical concepts and methods defined for these variables should also generalize the classical ones. 
As we will see below, the DSD linear regression model defined for histogram-valued variables [8] and its 
present particularization for interval-valued variables, may be written for classical variables since their 
values are degenerate intervals (the upper and lower bounds are identical). 

Proposition 2.3 The expression that allows predicting the values that the response variable takes in a 
classical linear regression model is a particular case of the one obtained by the DSD linear regression 
model for interval-valued variables given in (1), if we consider intervals where the upper and lower 
bounds of the intervals are the same. 

Proof: Consider the observations of the explicative classical variables Xk, with k e {1, 2, . . . ,p} and the 
observations of the response classical variables Y. For each unit j, the observed values of the variables 
Xk are real numbers bx^(j) that may be represented by the interval [bxkU)^ ^XkU)] °^ ^Y '^^e quantile 
function "^^ (t) — bx^{j) (that in this case is a constant function). For each unit j, the predicted value 
of the classical variable Y, is the real number y{j) that may similarly be represented by an interval or a 
quantile function. 

Equation (1) allows predicting the values of variable Y, as follows: 

p 

fe=i 

v/iihak,(3k > 0, fc G {1, 2, . . . ,p} and 7 e M. 

As ak, Pk > 0, ak — /3fc is a real number If we consider A^ = a^ — /3k we have the classical linear 
regression model 



KJ) =7 + 51 •^'=^'^* 



U) 



withAfe e Mandfc e {l,2,...,p}. D 



fe=i 
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As we have referred before, in a situation of degenerate intervals, the function to optimize is not strictly 
convex, and therefore more than one optimal solution exists. However, for all parameters ak and (3k we 
obtain the same parameter A^. Since no constraint is imposed on this parameter, we have in this case a 
classical linear regression model. 

Also, the goodness-of-fit measure for interval-valued variables is a generalization of the coefficient of 
determination R^ of classical variables. To obtain this result, it is first necessary to prove that follow 
proposition. 

Proposition 2.4 The Mallows distance between intervals reduced to real numbers is the Euclidean dis- 
tance between two real numbers. 



Proof: Consider two intervals Ix and ly with equal bounds, Ix = [&i,&i] and ly — [62,^2] with 
61 , 62 G IK; those intervals may be represented by the quantile functions '^^ (t) = 61 and 'f^^ (t) = &2 
forO << < 1. 



The Mallows distance in Definition 2.2 applied to these particular intervals Ix and ly whose centers are 
61 and 62, respectively, and both have range 0, is: 

So, we obtain the squared Euclidean distance between two unidimensional points. D 

To conclude the previous result, we just need to state the following straightforward proposition: 

Proposition 2.5 The goodness-of-fit measure in Definition 2.4 particularized to degenerated intervals is 
the coefficient of determination E? of the classical linear regression model. 

Therefore, it may be said that the DSD Model under uniformity is a theoretical generalization of the 
classical linear regression model. 

2.4 The single DSD Regression Model for interval-valued variables 

Using Definition 2.3 for the special case of only one explicative variable, the predicted quantile function 
^~ J for each unit of the predicted interval-valued variable is given by 

*^J^.)W=7 + (a-/3)cxO) + (a + /3)r-xo-)(2i-l), < i < 1. (9) 

The corresponding predicted interval /o/ ^ is the following: 



'YU) 



^Y{3) - 



^Lx(j) - PIxu) + 7, alx(j) - Plx(j) + 7 



(10) 



Proposition 2.6 Consider the interval-valued variable Y predicted by the DSD Model from the interval- 
valued variable X. From this relationship we may conclude: 
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1. The centers of the predicted intervals are in a classical linear relation with the centers of the 
observed intervals of the variable X. 

2. For each unit j, the ratio between the half ranges of the intervals Y(j) and X{j) is constant. 

Proof: 

From the DSD Model we obtain the relationship between the centers and half ranges of the interval- valued 
variables in Equations (3) and (4). Particularizing these equations to one explicative variable, we obtain 
for each unit j, with j = {1,2, . . . , m} the following: 



c 



YU) = (" - /^) cx(j) + 7 



and 



^Y{j) 



Tyu) = (a + /3) rxu) ^^ —^ =a + P 



So, when two interval-valued variables are in perfect linear relationship, the centers of the intervals are in 
a perfect classical linear relationship and the ratio of the ranges of the intervals is constant and equal for 
all units. D 

In this situation, when we predict one interval-valued variable from only one interval-valued variable, 
it is straightforward to obtain the expressions of the parameters a, j3 and 7 of the DSD Model. To find 
these expressions it is necessary to solve the quadratic optimization problem with non-negative constrains 
for the parameters a and /3, as described in (5), but now considering only one explicative variable. The 
minimization problem is in this case as follows; 



mill /(a,^,7) = ^ 



2 1 



s.t. 5i(a, /3,7) = -a < 

7eM 



Proposition 2.7 Consider the minimization problem in Equation (11). When the function to minimize is 
strictly convex, the optimal solution of this problem, i.e., the values estimated for the parameters of the 
DSD Model when the objective function reaches the minimum value, are given by: 

j=l j=l j=l j=l 
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In this case. 



3 -u; ^ 3 









7* =y- (a--/3*)X. 



• VY \^xu)^YU) Y {^xu)~xy < f2 (cro) - y) c^o) Y. I'^xuy ^^^^ a' = V /3* = 0. 



/n this case, 

- VY 3''^o)'-yo) > Y {"Yij) -y) cxu) then 



3=1 ^ 3=1 

Y 3''xO)''v(j) - Y ('^'^O) - ^) {<'XU) - ^) 
a* = 0; p* = i^ i^ ;7*=F + ,3*X. 



Y{<'x{j)-X) +Yo''x{3) 



3 = 1^ 



- VY -^'^xu)rY(j) < Y {''Y(3) - y) cxu) then 

j = l 3 = ^ 

Y T^''X(3)rY(3) + Y (=^U) - ^) {"XU-) - X) 



-; /3* = 0; 7* = y- a' X . 



Y ("xu) -^) +Y a^'io) 

3=1 3=1 



* If Y ^^x{j)rY(j} = Y \'^Yl3> ^ ^) '^X(j} then a" = 0; /3* = Oand-y* = Y. 

3 = 1 3 = 1 



Proof: The proof is given in Appendix A. D 
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3 Simulation studies 

The simulation studies that we will now present have two main goals. The first study aims at identifying 
the error function characteristics that are needed to disturb the linear regression in different given ways. 
In the second study, we want to evaluate empirically the behavior of the parameter estimation of the DSD 
Model applied to interval-valued variables, when the explicative and response variables present different 
levels of linearity. 

3.1 Building symbolic simulated data tables 

To build the symbolic simulated data tables it is necessary to generate the observations of the interval- 
valued variables Xk, k = {1, . . . ,p} and Y, where Y is the variable to be modelized from Xk by the 
DSD Model. The process to obtain these data tables is similar to the one used in the simulation study for 
histogram-valued variables in Dias and Brito [ ]. To obtain the m observations associated to a interval- 
valued variable Xk, we start by uniformly simulating 5000 real values corresponding to each unit. For 
each observation, we select the minimum and maximum of these values and build an interval associated 
to each unit. For the explicative variables Xk, we consider three levels of variability: 

• Low variability - when the intervals associated to the variable Xk have similar small half ranges; 

• High variability - when the intervals associated to the variable Xk have similar large half ranges; 

• Mixed variability - when we have a mixture of intervals associated to the variable Xk with variable 
half ranges. 

Afterwards, the intervals that are the observations of the interval-valued variable Y are obtained consid- 
ering the DSD Model for particular values of the parameters and the error function Ej (t) . So, the values 
of the interval-valued variable Y, for each unit j are obtained by 

fc=l k=l 

with 

ej(i) =a(j) + (2t-l)6(j) t G [0, 1] 

Each quantile function vJ/^J , ., (t) is randomly disturbed by an error function Ej {t) for different values of 
a(j) and 6(j) . The values of 6(j) cannot be larger than the respective value of the half range ry (j), else 
for this unit j the half range rY(j) would be negative. 

To perform the simulation study, symbolic data tables that illustrate different situations were created ac- 
cording to a selected factorial design. For each situation considered, 1000 data tables were generated. The 
values of in the tables of the Appendixes B and C, are the mean of 1000 values together with the respective 
standard deviation values s. 
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3.2 Simulation study I 

In the first simulation study, the goal is to analyze the behavior of the error function and see if is it possible 
to establish a relationship between the error function and the goodness-of-fit measures. To analyze the 
behavior of the error function, we consider intervals (the observations associated to the explicative and 
response variables) that have low variability, high variability or a mixture of intervals with variable half 
ranges. The following goodness-of-fit measures are considered in this study: 

• fl, where fl is the measure deduced from the DSD Model (see Subsection 2.2); 

• Root-mean-square error {RMS Em), a measure defined using the Mallows distance (also used in 
the DSD Model), proposed by Irpino and Verde [''7]; it is defined by 



RAISE 



M 



\ 



j = l"'0 



n]jt)-^YUt)fdt 



Factorial design 

In this study a full factorial design was employed, with the following factors: 

• Sample size: m=10;100. 

• Number of explicative interval-valued variables p = 1. 

• Levels of variability in the explicative variable X. (The distribution of the values in microdata is 
Uniform). 

i) Low variability - X{j) ^ U{6i{j),S2{j)) are randomly generated considering for each j E 

{1, . . . , m} , SiU) ^ U{~2, 0) and ^aQ') ^ U[A, 6); 

ii) High variability - X{i) ^ U{5y,{j),5i{j)) are randomly generated considering for each j e 

{1, . . . , to} , <53(j) ~ W(-14, -12) and S^ij) ^ U{16, 18); 

iii) Mixture with variable half ranges - X{j) ^ U{5^{j), 5q{j)) are randomly generated consider- 
ing for each j e {1, . . . , to} , several options: 

- 5^(3) ^ U{-2, 0) and 6^{j) ^ U{0, 2); 

- <55(j) - U{-i, -1) and 5e{3) ^ U{9, 11); 

- S^U) ^ U{-11, -9) and S^U) ^ Z^(29, 31); 

- S^U) ^ U{-1, 1) and SeU) ^ U{19, 21). 

• Parameters of the DSD Model. The selection of the parameters influences the levels of variability 
in the response variable Y. 
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i) a — 2; /3 = I; J = —1 (generate intervals with low (high) variability when the intervals of the 
explicative variable have low (high) variability); 

ii) a = 6; P = 0; "f — 2 (generate intervals with moderate/high variability when the intervals of 
the explicative variable have low or high variability); 

iii) a = 2; /3 = 8; 7 = 3 (generate intervals with high variability when the intervals of the 
explicative variable have low or high variability). 

• The error function ej{t) = at^j-^ + (2t — l)6(j), with t E [0, 1] is defined considering: 

i) Different levels of variability for the values afjy The values of a^j are randomly (uniformly) 
generated mU{-Sa, Sa) with s„ = {0, 2, 5, 10, 20, 40, 80, 120, 180} . 

ii) Different levels of variability for the values 6(j). The values of 6(j) are randomly (uniformly) 
generated in Us, = U{-Sb, st) with s^ = {0, 1, 2, 3, 4, 5, 6, 10, 20, 40, 80, 120} . As the 
value of 6(j) cannot be larger than the respective minimum value of the half range ry* (j) , in all 
situations when mr = min { ''y • (j) } is lower than Sb we consider l/lg, = U {—mr, mr) . 

je{l,...,m} 

The selection of the values Sa and Sf, is done according to the size of the values in the intervals 
associated to the response variable Y. For this simulation study, to choose the highest value of s^, 
we consider that a(j) must be outside the interval [cy. q) — ry (j) , cy. (j) + ry (j)\ ■ For the value 
Sf), the last chosen value is close to mr, since for higher values results are similar. 

Results and conclusions 

The tables with the results of the study may be found in Appendix B. From Tables 7 to 9 we present 
the means of the goodness-of-fit measures Vt and the means of the RMSEm when the interval-valued 
variable X presents low variability and the interval-valued variable Y was generated by the DSD Model 
considering the three selections for the parameters. The variability in Y is lower when the intervals of 
Y are generated by the model a = 2;/3 = l;7 = — 1 {Table 7); moderate when the intervals of Y are 
generated by the model a = 6;/3 = 0;7 = 2 {Table 8) and higher when the intervals of Y are generated 
by the model a — 2] P — %\ ^ — i {Table 9). To analyze the behavior of the error function and the 
impact of the values a and h in the disturbance of the linear relation between interval-valued variables we 
considered several possible options for b; several values of a were associated with each h. From Table 10 
to 12 and from Table 13 to 15, we present the results of the similar studies applied to a interval-valued 
variable X that presents high variability or different variabilities. As concerns the influence of the values a 
and b that compose the error function that disturbs the function, we may observe that when the variability 
in all intervals of the interval- valued variable X is low or high {Tables 7 to 12), the linearity between the 
data is more affected by the values of a then the values of b. In all situations, when we consider the same 
disturbance for the values of a, the increase in disturbance of the values of b affects less the linear relation 
between the variables than when we considered the same disturbance for the values of b and increase 
the disturbance of the values of a. Figures 4 and 5 illustrate the situation in Table 7, where X has low 
variability and the parameters of the DSD Model sae a — 2; (3 — 1; and 7 = — 1. 
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b = a bel((-2,2) bsUt-S,5) ie W(-10,10) 
a€U(-2,2) 



> =0 as U(-2,2) oS«(-5,5) ci eM(-10,10) 
heU{-2,2) 



Figure 4: Mean values of ft and the respective standard deviation for different error functions. 



m 



b = a bell{-2,2) beU{-5,5) ie W(-10,10) 
a€U(-2,2) 



> =0 as U(-2,2) oS«(-5,5) a s«(-10,10) 
teM(-2,2) 



Figure 5: Mean values of RMSEm and the respective standard deviation for different error functions. 
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It is important to underline that the simulation study imposes a higher limit for the selected values of bj 
in the error function. This limitation prevents the analysis of the behavior of the models when the values 
of bj are high, according to the size of the values of the half ranges ry (j) or when we have a mixture of 
very different half-ranges (Tables 13 to 75). When these situations occur, the disturbance of the perfect 
linear regression, as previously described, and that we use to generate the symbolic data tables, is affected 
almost only by the values of a. 

Analyzing in detail the obtained results enables concluding that to obtain a model with low Unearity, the 
value of ttj should not be in the interval ly, ^ j ,Iy*(j) ■ Considering this information, it is possible to 
suitably select the values of aj to disturb the linear relationship between interval-valued variables when 
the intervals in all observations have similar half ranges. Consequently, higher values of aj are necessary 
in the error function Ej to disturb the linear relationships when the half range of the intervals of the 
response variable is large. However, when we have a mixture of half ranges in the explicative variables, 
this choice is more difficult. 

In this study we considered two measures to assess the goodness-of-fit: the coefficient of determination 
ri and the root-mean-square error RMS Em- According to the obtained results we may conclude that the 
RMS Em is not a relative measure. This measure takes into account the size of the values in intervals 
and therefore the magnitude of the values that compose the error function must take into account the size 
of the values that compose the intervals, when the goal is to disturb the perfect linear regression in a 
similar way. For interval-valued variables, even when the values of the intervals have very different sizes, 
we can have similar results of the measure RMSEm when we disturb the perfect linear relationship 
with similar error functions (similar values are selected for the values aj and bj to compose the error 
function Ej). However, the respective values of Q, may be very different. For example, from Tables 7 to 
10, when the error function considers a E U{~20, 20) and 6 e W( — 10, 10), the mean value of RMSEj^j 
is always around 11 whereas the respective mean value of 51 are very different for the different situations. 
The measure 51 evaluates the quality of the linear relation independently of the magnitude of the values 
whereas to interpret the values of the measure RMSEm we have to take into consideration the size of 
the values in the intervals. 



3.3 Simulation study II 

In the second simulation study, the goal is to analyze the behavior of the parameters' estimation and the 
performance of the DSD Model considering two levels of linearity between the interval-valued variables. 
For all situations, the observations of the interval-valued variables are generated from micro data with 
Uniform distribution. In addition to the goodness-of-fit measures 57 and RMSEm considered in Simu- 
lation Study I, we compute the lower and the upper bound root-mean-square (RMSEj^ and RMSE(j, 
respectively) that Lima Neto and De Carvalho [12, II] use to study the performance of their linear re- 
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gression models. These measures are defined as follows: 



RMSEl = 



\ 



^ m 



j=i 



RMSEr 



\ 



m ' ^ 



i=i 



with [/(j), /(.?)[ and /(.?), /(i) 



the observed and predicted intervals, for each unit j. 



Factorial design 

In this study a full factorial design was employed, with the following factors: 

• Number of explicative interval-valued variables: p = 1 and p = 3. 

• Parameters of the DSD Model. 

o For p = 1 : 

i) a = 2; /3 = 1; 7 = —1; (a and /3 are close) 

ii) a = 6] P — 0; ■J ~ 2; (a is higher than /3) 

iii) a — 2; f3 — 8; J — 3: (a is lower than (3) 

o For p — 3 : 

i) ai = 2; f3i — 1; ^2 — 0.5; (32 — 3; a^ — 1.5; (3^ ^ 1: -y = —1; (the values of a and (3 are 
close) 

ii) ai — 6; f3i — 0; 0:2 — 2; (32 — 8; 0^3 — 10; (3^ ~ 5: 'f — 3; (the values of a and (3 are 
apart) 

• Levels of variability in explicative variables Xk ■ 

i) Low variability - Xk{j) ^ U(Si{j),S2{j)) are randomly generated considering for each j E 
{1,..., to} and fee {1,2,3} : 

- fc = 1 - Si{j) ^ U{-2, 0) and 52{]) ^ U{4, 6); 

- k^2- Si{j) - U{1, 3) and (52 (.7) - U{3, 5); 

- fc == 3 - (5i(j) - U{4:, 6) and (52 (.7) -- U{9, 11); 

ii) High variability - Xk{j) ^ U(Ss{j), (54 (j)) are randomly generated considering for each j G 
{1,..., to} and fee {1,2,3} : 

- fc = 1 - S^ij) - W(-14, -12) and Siij) - ZY(16, 18); 

- fc = 2 - (53 0') ^ ^(1, 3) and S^ij) - Z^(25, 27); 
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- fc = 3 - Ssij) ^ U{-16, -14) and S^ij) ^ Z^(-l, 1); 

iii) Variable half ranges - Xk{j) ~ l^{5b{j),^&{i)) are randomly generated considering for each 
j e {1, . . . , m] and k £ {1. 2, 3} , the several options: 

- <55(j) ^ Ui-2, 0) and Se{j) ^ W(0, 2); 

- S^U) - Ui-3, -1) and SeU) ^ U{9, 11); 

- <55(j) ^ Z^(-ll, -9) and Se{j) ^ U{2Q, 31); 

- 5^(3) ^ U{-1, 1) and 6^{j) ^ U[19, 21). 

• Two levels of linearity are considered. For each j G {1, . . . , m} , the values of a(j) and h{j) are 
randomly generated as follows: 

i) Low linearity - a{j) - ^^(- "'+"" , "'+'"" ) and h{j) - U{~mr, mr). 

ii) High linearity - a (j) ^ ^^(-| '"'+"" , | "''+"^" ) and 6(j) - Z^(-imr ^mr). 



mill j/y.(,)[ 



max {/y*o)} 

jG{l,...,m} 



and ?7ir = min l^i^'fj)} 
je{i,...,m} 



where mZ = 
• Sample size: m=10; 30; 100; 250. 

Results and conclusions 

The tables with the results of the study can be found in Appendix C. From Tables 16 to 18 the results 
obtained for the parameters estimated and the goodness-of-fit measures, with p = 1 for the three selected 
values of a, /3 and 7. In Tables 19 to 22 similar results are presented for the considered cases where p = 3. 
Based on the obtained results, presented in Appendix C, we can see that the behavior of the parameters' 
estimation is independent of the number of explicative variables in the model and the parameters selected 
for the model. In each of these situations, three levels of variability in the expUcative variables Xk were 
considered each of them with two levels of linearity, and two types of behavior were observed. 

o When the hnearity between the variables is high and the diversity of the half ranges of the intervals 
of Xfc is low or we have variable half ranges, the estimated parameters are close to the initial 
parameter values. However, for high levels of linearity, when the variability of the half ranges of 
the intervals is high and mainly when the sample size is small, the estimated parameters are more 
distant from the initial parameters. This difference is larger in the independent parameter 

o When the level of linearity between the variables is low, many of the estimated parameters were distant 
from the original ones. These cases, observed mainly when the number of observations is low, are 
not surprising because other models may exist that adjust better the interval data. 

According to this, the analysis of the behavior of the MSE and the mean of the estimated parameters 
is essentially applicable in situations where the level of linearity is high. For almost all these cases, we 
observe that the values of the MSE decrease and tend to zero as the number of observations increases 



23 



and the mean of the estimated parameters becomes very close to the respective parameters of the model. 
For situations where the half ranges of the intervals of Y is larger (which occurs when the variability of 
X is high or when the values of the parameters are far apart), the independent parameter has a high value 
of MSB and a high standard deviation associated to the mean value. As such, intervals with large half 
ranges in the response variable cause more instability in the DSD Model and therefore the parameters' 
estimation is more unstable, essentially on the independent parameter. In the boxplots presented in Fig- 
ures 6 to S we may observe the behavior described above for the situations where p = 1 and the original 
values of the parameters are a = 2; f3 = l;j = —1. 



^ 



= 100 m=250 

rmble halt ranges 



Figure 6: Boxplot for the estimated parameter a for a high level of linearity and when the original 
parameters are a = 2; /3 = 1; 7 = —1. 
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Figure 7: Boxplot for the estimated parameter /? for a high level of linearity and when the original 
parameters are a = 2; /3 = 1; 7 = —1. 

Based on this simulation study, we can also assess the behavior of the coefficient of determination asso- 
ciated to the DSD Model and the values of the root-mean-square errors. The values obtained for il show 
that this value provides a good evaluation for the level of linearity. The models slightly disturbed present 
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Figure 8: Boxplot for the estimated parameter 7 for a high level of linearity and when the original param- 
eters are a = 2; /? = 1; 7 = —1. 

values of ft close to one. On the other hand, when the error function applied to the model causes a high 
disturbance in the Unear relation, the values of ft are closer to zero. Furthermore, the mean values of H. are 
consistent with the respective values of the measures RMS Em', RMSEl and RMSEu. In general, as 
expected, in each situation and to the respective level of variability of the explicative variables, the highest 
values of 51 correspond to the lowest values of RMS Em- The values that compose the error function Ej 
are obtained considering the same criterion in all situations, but when the explicative variables include 
a mixture of intervals with different half ranges, the values that we obtain for ft are lower than the ones 
obtained in other situations. This happens because as we have a variety of intervals, the error functions 
will not affect all intervals in the same way. 



4 Applied examples 

4.1 The relation between time of unemployment and years of employment 

The 2008 Portuguese Labour Force Survey provides individual information about the people that live in 
Portugal. The original data table that we analyzed contains, among others, demographic variables (such 
as gender, marital status, age, level of education, employer..) and geographical location (region, city,...). 
In this study we are interested in analyzing if the time of unemployment (in months) is related to the 
time (in years) that people have worked previously. However, we are not interested in performing this 
study for each individual, as it may be of greater interest to determine what happens in certain categories, 
such as young women who live in North of Portugal. Since each of these categories consists of several 
individuals, the observed value is no longer a single point but an interval. So, in this case, the symbolic 
data table is built considering that the units (higher units) are classes of individuals obtained by crossing 
gender xregionxagex education. Here, there are two genders (female (F), male (M)), four regions 
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(north (N), Center (C), Lisbon and Tagus Valley (L), South (S)); three age groups (15 to 24 (Al), 25 to 
44 (A2), 45 to 64 (A3)) and three levels of education (basic education (B), secondary education (S) and 
graduate (G)). In total we have 2x4x3x3 — 72 possible classes (categories). The time of unemployment 
and the time of work before unemployment are now interval-valued symbolic variables. 

Table 2 represents the symbolic table that results from the original data table, for the variables X (time of 
employment before unemployment) and Y (time of unemployment). 



Units 


Y 


X 


Units 


Y 




X 


Units 




Y 


X 


FxCxAlxS 


[3; 49] 


[0;4] 


_FxA'xA3xS 


[0; 123] 


[23; 35] 


MxLxA3xS 


[1;244] 


[22; 57] 


_FxCxAlxS 


[i;6l 


[0;21 


FxSxAlxB 


[i;52] 


[i;7] 


MxLxA3xS 


[2; 65] 


[25; 50] 


_FxCxA2xS 


[2; 147] 


[2; 34] 


FxSxAlxS 


[1;36] 


[0;9] 


MxLxA3xG 


[7; 44] 


[28; 40] 


_FxCxA2xS 


[3; 61] 


[5; 22] 


FxSxAlxG 


[1;13] 


[0;i] 


MxNxAlxB 


[1;33] 


[0; 18] 


_FxCxA2xG 


[4; 16] 


[0; 15] 


FxSxA2xB 


[1;101] 


[0 


33] 


MxNxAlxS 


[1;15] 


[i;4] 


_FxCxA3xS 


[1; 108] 


[23; 47] 


FxSxA2xS 


[0;96] 


[0 


25] 


MxNxA2xB 


[1;97] 


[1;35] 


_FxLxAlx_B 


[1;18] 


[i;7] 


FxSxA2xG 


[i;2i] 


[1 


27] 


MxNxA2xS 


[1;46] 


[0;21] 


_FxLxAlxS 


[1; 19] 


[i;ii] 


FxSxAZxB 


[1;265] 


[8 


52] 


MxNxA2xG 


[2; 100] 


[2; 14] 


FxLxA2xB 


[0; 156] 


[3; 34] 


FxSxAZxS 


[3; 26] 


[20; 37] 


MxNxASxB 


[0; 159] 


[15; 52] 


FxLxA2xS 


[2; 69] 


[3; 25] 


MxCxAlxB 


[3; 6] 


[0;8] 


MxNxASxS 


[9 


35] 


[20; 40] 


FxLxA2xG 


[0; 63] 


[0; 22] 


MxCxAlxS 


[2; 3] 


[0;4] 


MxNxASxG 


[9 


19] 


[31; 36] 


FxLxASxB 


[1;320] 


[29; 58] 


MxCxA2xB 


[2; 97] 


[10; 28] 


MxSxAlxB 


[1 


35] 


[0; 10] 


FxLxASxS 


[2; 162] 


[22; 36] 


MxCxA2xG 


[7; 13] 


[4; 10] 


MxSxAlxS 


[4 


63] 


[1;6] 


FxLxASxG 


[8; 27] 


[12; 32] 


MxCxASxB 


[4; 98] 


[30; 51] 


MxSxA2xB 


[0; 157] 


[4; 35] 


FxNxAlxB 


[1;61] 


[0;9] 


MxCxASxS 


[20; 38] 


[25; 39] 


MxSxA2xS 


[i;2i] 


[7; 24] 


FxNxAlxS 


[0; 10] 


[0;3] 


MxLxAlxB 


[2; 20] 


[0;9] 


MxNxA2xG 


[4; 18] 


[5; 20] 


FxNxA2xB 


[1;325] 


[6; 32] 


MxLxAlxS 


[4; 14] 


[i;9] 


MxSxA1,xB 


[1;274] 


[26; 56] 


FxNxA2xS 


[2; 88] 


[2; 25] 


MxLxA2xB 


[1; 194] 


[0;31] 


MxSxA'ixS 


[11; 26] 


[28; 42] 


FxNxA2xG 


[2; 80] 


[1;25] 


MxLxA2xS 


[4; 133] 


[3; 23] 








FxNxASxB 


[1;372] 


[11; 57] 


MxLxA2xG 


[6; 65] 


[4; 16] 









Table 2: Symbolic data table where the two variables, time of activity before unemployment and time of 
unemployment are interval-valued variables. 

The main goal of this study is to analyze the linear relationship between the interval-valued variables: 
logarithm of the time of unemployment, LNY, {LNY — LN{Y + 2)), and time of activity before 
the unemployment X, considering as observed units (higher units) the classes of individuals previously 
described. 

We predicted the quantile function representing the interval taken by the interval-valued variable LNY 
from the DSD Model, and obtained: 






2.2277 + 0.0779* 



x(j) 



it) 



0.0503*;Y(^.)(l-t) 



In this case, the predicted interval for each unit j, is given by 

[0.0276cx(j) - 0.l282rx{j) + 2.2277, 0.0276cx(j) + 0.1282rxo) + 2.2277] 
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As we interpreted in Subsection 2.2, the interval-valued variables X and LNY have a linear relation that 
tends to be direct, because the value estimated for the parameter a = 0.0779 is slightly greater than 
/3 = 0.0503. For the set of classes of individuals to which the data refer, when the symbolic mean of 
time of activity before the unemployment increases one year, the symbolic mean of the LNY (in months) 
increases 0.0276. However, the relationship described by the DSD Model is not very strong. The value 
of the goodness-of-fit measure fl deduced to the model is for these data 0.7715. The scatter plot of these 
data can be observed in Figure 9(a). However, as we have a large number of units, the scatter plot that 
represents the observed intervals of both variables by a rectangle is very hard to interpret and we chose to 
represent the diagonals of the rectangle. 




X 




(a) Observed intervals for LNY. 



(b) Predicted intervals for LNY. 



Figure 9: Scatter plot considering the observed intervals for the interval-valued variables X and LNY or 
the predicted intervals. 

As we have said in Subsection 2.2, the perfect linear regression by the DSD Model between two interval 
valued variables induces a perfect linear regression between the centers of the intervals and also induces 
that the ratio of the ranges of the intervals is constant and equal for all observations. These behaviors can 
be illustrated by the scatter plot in Figure 9(b), that considers the intervals observed to the variable X and 
the predicted intervals by the DSD Model to the variable LNY. 

The purpose of this example is not only to illustrate the DSD Model, but also to compare the results with 
other models already proposed [12, 11,5, 10, 9]. In Table 3 we present the models and the Root Mean 
Square Error generally used as measures of goodness of fit. 

In this example the CRM and CCRM are the same because in the CRM the parameters estimated for the 
half ranges are all non-negative, the constrains imposed in CCRM to these parameters are met. We can 
also observe that the linear regression induced by the DSD Model relative to the centers of the intervals 
is obtained by the models where a linear regression between the centers is considered. The results of 
the Root Mean Square Error (RMSE) allow comparing the predicted and the observed intervals of the 
response variable LNY. These measures are not deduced from the model, therefore they may serve as 
independent comparison measures. Observing the values of the RMSE, we can conclude that the DSD 
Model and CRM (and CCRM) have similar results, that is not surprising because the linear regression 
between the centers is the same. It is important to underline that the goal of the work developed in this 
paper is not propose a model that provides better results than the previous models. The DSD Model for 
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Models 


Expressions that allow predicting the intervals of LNY for each j 


R.MSEl 


B.AISEu 


B.AISEm 


DSD 


*^0)(*) = 2.2277 + 0.0779*- J^.,(t) - 0.0503*- J^., (1 - t) 
ClnyU) = 2.2277 + 0.0276CXO) andFj^jvi^y, = 0.1282rx(j) 


0.5745 


0.6710 


0.4679 


CM 


CLNYij) = 2.2277 + 0.0276CXO) 


1.1622 


1.3146 


0.7759 



Billard 2007 
MinMax 

CRM 



clnyu) = 1.9009 + 0.0468cxo) 

Ilnyu) = 1.2236 + 0.0206/^„.^(^) 

Tlnyu) = 2.8704 + 0. 0436/ j.„y(,) 

CLNYd) = 2.2277 + 0.0276cx(j) 

fLNYU) = 1.0642 + 0.0855rx(,) 



1.1504 
0.4725 

0.4458 



1.0365 
0.7329 

0.6541 



0.7255 
0.4621 

0.4397 



CLNYif) = 2.2277 + 0.0276CXO) 
rLNY{j) = 1.0642 + 0.0855rx(,) 



0.4458 



0.6541 



0.4397 



Table 3: Comparison of the performance between linear regression model for interval- valued variables. 

interval-valued variable emerges from the particularization of a more general model, the DSD Linear 
Regression Model for histogram-valued variables. The advantage of the DSD Model when applied to 
interval-valued variables is that it allows taking into consideration a distribution within the intervals. 

4.2 Predicted burned area of forest fires, in the northeast region of Portugal, 

This study considers forest fire data from the Montesinho natural park, in the northeast region of Portugal. 
The original data can be found in [ ] and details are described in [-•]. For this study we selected 
the response variable area (the burned area of the forest (in ha)) and three explicative variables: temp 
(temperature in Celsius degrees); wind (wind speed in km/h); rh (relative humidity in percentage). As 
in the classical study [ ], the response variable area was transformed with a ln{x + 1) function and 
we represent it as LNarea. To build the symbolic data (macrodata) we aggregated the information by 
months. The units (higher units) of this study are the months and the observations of the variables temp, 
wind, rh and LNarea associated to each month were organized in intervals. To build these macrodata we 
considered only the months and the records in which forest fires occurred. For this reason January and 
November were eliminated. The symbolic data considered in this example is represented in Table 4. 

Considering the conditions described above, the model that allows predicting the intervals of LNarea 
from the intervals of the explicative variables temp, wind and rh for each month j is as follows: 



^"3 m = 1.8637 

LNarea(jy ' 



0.0224* 



temp{j) ^ 



(t) - 0.0215*71.(1 -t)- 0.0143vl/;,\,,(l - t) (12) 



U)' 



witht e [0,1]. 

The goodness-of-fit measure associated to this situation is J7 = 0.9202, that shows that this linear re- 
gression model describes well the relationship between the interval-valued variables. So, if we know 
the forecast for the temperature, wind and relative humidity for one month, it is possible to predict the 
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Months 


LNai-ea (Y) 


temp 


wind 


rh 


Feb 


[0.74; 3.97] 


[4.6; 12.4] 


[0.9; 9.4] 


[35; 82] 


Mar 


[0.67; 3.63] 


[5.3; 17] 


[0.9; 9.4] 


[26; 70] 


Apr 


[1.47; 4.13] 


[5.8; 13.7] 


[3.1; 9.4] 


[33; 64] 


May 


[3.58; 3.58] 


[18; 18] 


[4; 4] 


[40; 40] 


June 


[0.64; 4.27] 


[14.3; 28] 


[1.8; 9.4] 


[34; 79] 


July 


[0.31; 5.63] 


[11.2; 33.3] 


[0.4; 8.9] 


[22; 88] 


Aug 


[0.09; 6.62] 


[11.2; 33.3] 


[0.4; 8.9] 


[22; 88] 


Sep 


[0.29; 7] 


[10.1; 29.6] 


[0.9; 7.6] 


[15; 78] 


Oct 


[1.9; 3.9] 


[16.1; 20.2] 


[2.7; 4.5] 


[25; 45] 


Dec 


[1.9; 3.2] 


[2.2; 5.1] 


[4.9; 8.5] 


[21; 61] 



Table 4: Burned area data, where the four variables LNarea, temp, wind and rh are now interval-valued 
variables. 

minimum and maximum of area of burned area of the forest. 

As we obtain a good behavior of the model in this situation, in the next study, we will compare the 
observed and predicted intervals associated to the variable LNarea for each month j, j E { February, 
March, April, May, June, July, August, September, October, December} . We will consider that when we 
predict the interval of hectares of burned area (LNarea) for month j this month will not be considered in 
building the model. The results are represented in Figure 10. The months are represented in the a;— axis 
and the intervals, observed and predicted in the two ways described above, are represented in the y— axis. 



Figure 10: Observed and predicted intervals for the interval-valued variable LNarea, for each month. 

Observing Figure 10 we can say that the prediction of the intervals is in general quite good, slightly 
worse for May and December Comparing the predictions obtained by the DSD Model in (12) with those 



29 



obtained by other models, we can observe that when the month value is not used in the estimation of the 
parameters of the model, small differences are generally observed. 

The expressions of the models proposed by Lima Neto and De CarvaUio [12] and Billard and Diday [9] 
that allow predicting the intervals of values of burned area of forest fires are as follows in Table 5. 

Models Expressions that allow predicting the intervals of LNarea for each jf 

^'l — . (i) = 1.8637 + 0.0224<^r^ ,.,(*)- 0.0215*7^ , ., (1 - t) - 0.0143^",V>(1 - t) 

DSD LNarea(j) temp{j)^ ' tempi])"- ' rh(])^ ' 

CLNareaU) = 1-8637 + 0.0009Ctg„p(j) - 0.0143c,,hy) 
rLNareaU) = 0.0439rte„p(^) + 0.0143r^h(j) 



CM 


<^LNarea{j) - 


: 1.9163 + 0.0015ct^„py) + 0m27c^i„aU) - 0.0158c^h(j) 


MinMax 


iLNarea(j) = 
iLNarea(i) = " 


1.1559 + 0.0123/t,™p(,) - 0.0379/^„,(^.) + 0.0085/,;,(^.) 
-0.3930 + 0.011277temp(j) + 0.23727„i„dy) + 0.0168T^h(j) 


CRM 


'^LNarea(j) = 
^LNarea(j) = 


: 1.9163 + 0.0015ct„„p(^.) + 0.0027c„j„d(^) - 0.0159c^;,(^) 
0.0091 + 0.0652rt,„p(^.) - 0.0072r„,„d(j) + 0.0089r,ft(^.) 


CCRM 


'^LNarea(j) = 


: 1.9163 + 0.0015ct^„p(^) + 0.0027c„i„^(^) - 0.0159c^^(^-) 
.reaU) = 0-0037 + 0.0651rt,„p(,) + O.OOSlr.^y) 



Table 5: Comparison of the performance between linear regression model for interval-valued variables. 

In this case, as one of the estimated parameters of the model associated to the half ranges in CRM is 
negative, the expression for the half ranges in CCRM is akeady different. 



Models 


RMSEl 


RMSEu 


RMSEm 


DSD 


0.1106 


0.1222 


0.1066 


CM 


0.3076 


0.2676 


0.1856 


MinMax 


0.1481 


0.0940 


0.1044 


CRM 


0.1030 


0.1161 


0.1038 


CCRM 


0.1034 


0.1159 


0.1038 



Table 6: Comparison of the performance of different linear regression models for the burned-area interval 
data. 

In Table 6 we present also the RMSE for the models previously proposed [12, 9] and for the DSD Model. 
As we observed in Example 4. 1 , the results of the RMSE calculated for the CRM, CCRM and DSD Model 
are again very similar 

In Figure 11 we may compare the predicted intervals of the values of burned area of forest fires in all 
months considering the linear regression models CM, MinMax, CRM, CCRM and DSD. 



30 





February 


^ 


^ 


.f^ 


02 


04 06 08 1 






02 0^ 06 OS 








02 0^ 06 OS 1 U 02 04 06 08 1 

- LNarea observed '-^^^'"dsd Pi'^^i^t^'^ LNarea|,|^ predicted LNarea^^^^j^^^ predicted LNarea^-.^!^ predicted LNarea|-,|-,^|^ predicted I 



Figure 11: Observed and predicted intervals for the LNarea in all months, predicted with several models. 



5 Conclusion 

An interval-valued variable is a particular case of a histogram-valued variable if for all observations we 
only have one interval with weight equal to one. A classical variable is a particular case of an interval- 
valued variable, when to all observations corresponds a degenerate interval (an interval where the lower 
and upper bounds are the same). Because of this link between histogram, interval and classical variables it 
was logical that the DSD Model for histogram-valued variables could be particularized to interval-valued 
variables that in turn could be particularized to real values, as we have observed in this paper. 

The main advantages of the DSD Model are that it defines a linear relationship between one response 
variable and n explicative variables without decomposing the intervals in their bounds or centers and half 
ranges. In fact, this model, as it uses the quantile function to represent the intervals, allows working 
with the intervals and consider the distributions within intervals. In this paper we assume the Uniform 
distribution in all intervals that are the "values" observed for the interval-valued variables. For these 
conditions, the DSD Model induces a relation between the half ranges and a relation between the centers 
of the intervals where the respective estimated parameters are not independent; in the case of the half 
ranges, this relation is always direct, similarly to what occurs in the Constrained Center and Range 
Method. 

The DSD Model has the potential of taking into consideration the distribution in the intervals associated 
to the observations of the interval-valued variables. As such, it is possible to adapt the proposed model 
to interval- valued variables with other distributions. For example, the DSD Model may be developed 
considering a triangular distribution in the intervals. As in most studies of Symbolic Data Analysis, it is 
considered that the values in the intervals are uniformly distributed, all descriptive statistics would also 
have to be also redefined. 

Furthermore, a generalization of the DSD Model is currently under development with the aim of obtaining 
a more flexible model. In this new approach, applied both to histogram-valued variables and interval- 
valued variables, the independent parameter is a quantile function instead of a real number. 

As a future research perspective, other models and methods in Symbolic Data Analysis based on linear 
relationships between interval-valued variables, such as logistic regression, may now be developed using 
this approach. 
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Appendix A. Proof of Proposition 2.7. 



Before prooving Proposition 2.7 it is necessary to consider two theorems [30] and to define the function 
to optimize in matricial form. 

Theorem .1 Consider the minimization problem in (11). Ifb* = (q;*,/3*,7*) is an optimal solution of 
this problem, b* must satisfy the constrains of the optimization problem and the Kuhn Tucker conditions: 



• Constrains: — a < and - 


-/3<0 


• Kuhn Tucker conditions: 




7. ^-A,=0 




2. ^-A.=0 




:; Sf{b') _ n 




4. \ia* = 




5. A2/3* = 




6. Ai,A2 > 0. 





Theorem .2 Consider the minimization problem in (11). If f{a, /3, 7), 51 (a, /3, 7) and g2{ct, P, 7) are 
convex functions, then any vector that satisfies the hypotheses of Theorem .1 is an optimal solution of the 
optimization problem in (11). 



The function f{a,P,^) = y, 

timized in problem (11) may be rewritten as follows: 



(cy(i) - (a - /3) cx(j) -7)^+3 (^y(j) -{a + 13) rx(j)f 



to be op- 



/(a,/3,7)=b^Hb + q^b + d (13) 



where the matrices and vectores involved are the following: 
• H is the hessian matrix, a symmetric matrix of order 3, 
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H 



E 

rn 

E 



1 2 V^ 2 I -'^ 2 



j^i j^i j^i 

i» 77^1 i^>i 

3 



E c^o") 



2 ''"2 X ^ 2 ''"2 X ^ 

j=i j=i i=i 



E ^-^0) E -^^U) 



m 



• q is the column vector of independent terms, 



q = 






j=i 



column 2^2cY(j)Cx{j) - o''>'0)^^(i) 



j=i 



E "2cy(j) 



b is the column vector of the parameters b = [a (3 7] 

1 



• di& the real value d — \J V^ 



J = l i=l 



^YU) + ^^YU)- 



Proof of Proposition 2.7: 

Proof: Consider the optimization problem in (1 1) where: 

a) the functions 51(0;, /3, 7) and g2{a,P,^) that define the non-negative constrains are convex, so the 

feasible region of the optimization problem is a convex set; 

b) f{a, /?, 7) is a convex function because H is positive semi-definite. Consider the matriz X defined in 

Equation (7) but now only for one explicative variable. In this particular case, we have: 



X 



CX(1) ~Cx(l) 1 

Cx(m) —C.X(m) 1 
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As H = X^X, H is positive semi-definite. 

c) the intervals of the expUcative variable X are not all degenerate {rx 7^ 0) or symmetric {cx ^ 0). In 
this situation, the columns of X are linearly independent, so H is positive definite and consequently 
the function /(a, /3, 7) is strictly convex. When the objective function is strictly convex the optimal 
solution is unique. 

As the optimization problem in (1 1) verifies the conditions of Theorem .2, it is possible to find the expres- 
sions of the parameters for the linear regression model in 2.4. Considering the Kuhn Tucker conditions 
(4) and (5) we have: 

Aia* = A A2/3* =0 ^ (Ai = V a* = 0) A (A2 = V ^* = 0) . 

So, we may consider four situations. 

I Suppose a* ^ j3* = 0. The system formed by the Kuhn Tucker conditions is 



^5 Ai - U 

^=0 



Solving this system we prove that in this situation 

m ^ m 

Y^ -^rxuVYU) = Y^ {cY(j) - Y) cx(f). 



i=i 



3=1 



So, in this case a* ~ 0; f3* = and 7* = Y. 
II Suppose a* = and A2 = 0. Considering that in this case, Ai > and (3* > 0, we have: 



da 



Ai = 



dp — ^ 
07 



from which we conclude that: 
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l3" = 



j = i j = i 



j = i j = i 



7* = Y + P'X 



III Suppose Ai = and /3* = 0. Considering that in this case, A2 > and a* > 0, we have: 



da 

^=0 



from which we conclude that: 



J2 o''xo)'-i'(3) + J2 (^i-o) - ^) (^^(3) - ^) 



E(^xu)-^) +EH<, 






i = i 



j = i 



j = i j=i j = i j = i 

7* = y- a*X 



IV Suppose Ai = and A2 = 0. Then, 



OQ. 



dfjb') _ 



dj 



From this system we conclude that: 
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a — — — 

J = 1 j = 1 

ri* _ J = l J = l j = l ■7 = 1 

J = 1 j = 1 

7* ^Y~(a* -/3*)X 



As in this case, a* > and /?* > 0, the expressions of ct* and /3* are non-negative only if 



m ^ m m m ^ 



j=i 



j=i 



j=i 



i=i 



r-o") ° 
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